%%{init: {'theme': 'base', 'themeVariables': {
'background': '#FAFAF5',
'primaryColor': '#4682B4',
'secondaryColor': '#1E3A8A',
'lineColor': '#1E3A8A',
'nodeBorder': '#1E3A8A',
'primaryTextColor': '#FFFFFF',
'textColor': '#191970',
'fontSize': '12px',
'width': '100%'
}}}%%
flowchart TB
A["Data Preparation<br/>- Clean data<br/>- Encode categorical variables"] --> B["Exploratory Data Analysis<br/>- Check distributions<br/>- Identify predictors"]
B --> C["Split Data<br/>- Train/Test sets<br/>- Stratify by fraud outcome"]
C --> D["Specify GAM Model<br/>- Select predictors<br/>- Define smooth terms<br/>- Family = binomial"]
D --> E["Fit Model<br/>mgcv::gam(...)"]
E --> F["Evaluate Model<br/>- ROC/AUC<br/>- Confusion Matrix"]
F --> G["Interpret Results<br/>- Plot smooth effects"]
G --> H["Predict New Data<br/>- Apply model to test or new cases"]
style H fill:#FF4C4C,stroke:#8B0000,color:#FFFFFF
Generalized Additive Models in Fraud Detection and Pattern Recognition
Data Science Capstone Project
Introduction
Generalized Additive Models (GAMs) have emerged as a powerful extension of traditional regression methods, offering a balance between predictive flexibility and interpretability. Originally introduced by Hastie & Tibshirani (1986) and Hastie & Tibshirani (1990), GAMs build on the framework of Generalized Linear Models (GLMs) by replacing the strictly linear predictor with a sum of smooth, data-driven functions. This structure allows models to capture complex nonlinear relationships while preserving interpretability, making them especially valuable in fields where transparency is critical, including finance, healthcare, auditing, and cybersecurity. Their ability to represent nonlinear effects in a way that stakeholders and regulators can directly review has positioned GAMs as an important tool in modern statistical and machine learning applications.
The foundations of GAMs are grounded in penalized likelihood estimation and iteratively reweighted least squares (HalDa, 2012), while modern implementations such as the mgcv package in R (Wood, 2017, 2025) have greatly improved their efficiency, scalability, and robustness. Penalization techniques introduced by Wood (2017) allow smoothness control, prevent overfitting, and address issues such as concurvity, making GAMs well-suited for noisy or high-dimensional datasets. These developments have made GAMs increasingly practical for real-world applications. Transparency also remains central: as Zlaoui (2018) illustrates, GAMs provide interpretable risk curves that visualize how each feature influences an outcome, offering critical insight in high-stakes environments.
Applications of GAMs across different fields underscore their versatility. In ecology, they have been used to map species distributions and detect environmental thresholds (Detmer, 2025; Guisan et al., 2002). In biostatistics, they have informed studies of health outcomes such as alcohol use (White et al., 2020). In finance and auditing, GAMs have uncovered irregular revenue patterns and detected fraudulent Medicare billing, with results that auditors and regulators could interpret directly (Brossart et al., 2015; Miller, 2025). Even in challenging contexts where noisy or uneven data reduce precision, studies have shown that recall and interpretability remain strong advantages of the approach (Detmer, 2025; Guisan et al., 2002; Tragouda et al., 2024).
Building on these foundations, researchers have proposed several extensions and innovations. Functional and Dynamic GAMs account for functional predictors and temporal dependencies, enhancing model flexibility for forecasting and time-series applications (DGAM, 2021; FGAM, 2015). Neural-inspired variants such as Neural Additive Models (Agarwal et al., 2021) and GAMformer (GAMformer, 2023) integrate deep learning techniques, improving computational efficiency and extending the ability of GAMs to model complex nonlinear data. Bayesian approaches provide clearer ways to quantify uncertainty and guide variable selection (Miller, 2025). Other tools such as Gam.hp (2020) strengthen transparency by quantifying predictor contributions. Furthermore, Microsoft’s Explainable Boosting Machine explored by Lou et al. (2012) adapts the GAM framework to include limited interactions, improving predictive performance while retaining interpretability.
Research also highlights the role of GAMs within broader fraud detection strategies. In financial contexts, Tragouda et al. (2024) applied GAMs to bank cheque fraud, demonstrating high recall (77.8%) even when data imbalance reduced precision. Brossart et al. (2015) used GAMs to identify fraudulent Medicare billing, showing that interpretability helped build auditor trust despite challenges with adapting to emerging patterns. Miller (2025) combined GAMs with ensemble models such as random forests to detect irregular revenue in financial statements, producing visualizations auditors could use directly. Beyond GAMs, graph-based frameworks have emerged as complementary approaches. For example, Chang et al. (2022) introduced Graph Neural Additive Networks (GNANs), extending GAMs to graph-structured data such as transaction networks and achieving 84.5% ROC-AUC in detecting suspicious users. Zhang et al. (2025) demonstrated that GAMs could model sequential features in telecom fraud detection but were often outperformed by graph neural networks (GNNs) when modeling complex relational data.
In parallel, other interpretable machine learning techniques continue to shape the fraud detection landscape. Hanagandi et al. (2023) applied regularized generalized linear models, including Ridge, Lasso, and ElasticNet, to highly imbalanced credit card fraud datasets, achieving strong performance (up to 98.2% accuracy with Ridge regression) and showing that careful preprocessing is essential for real-time fraud detection. Generative approaches also contribute: Zhu et al. (2023) demonstrated how Generative Adversarial Networks (GANs) can generate synthetic transaction data to improve robustness against class imbalance. Collectively, these innovations expand the interpretability-performance frontier and highlight how transparent modeling frameworks, including GAMs and their extensions, remain central to modern fraud analytics.
The primary objectives of this analysis are to leverage the fraud detection transactions dataset to build and evaluate effective fraud detection models using Generalized Additive Models (GAMs). Specifically, the goals are:
Develop Robust Models: Construct models that accurately distinguish between fraudulent and legitimate transactions using GAMs.
Identify Key Features: Pinpoint significant variables that contribute to fraud risk, improving interpretability and providing actionable insights for financial institutions.
Provide Practical Insights: Generate findings that enhance anomaly detection, risk management, and financial security strategies, while addressing challenges such as noise and class imbalance.
In this study, we apply GAM methodology using RStudio and the mgcv package to the Fraud Detection Transactions Dataset from Kaggle (Ashar, 2024). This synthetic yet realistic dataset provides an opportunity to test GAMs in a controlled but meaningful context. Our aim is to evaluate whether GAMs can balance predictive strength with interpretability, creating models that are both accurate and transparent for fraud detection.
Methods
Generalized Additive Models (GAMs) extend traditional regression by allowing flexible, nonlinear relationships between predictors and the response variable. In the context of fraud detection, GAMs model the probability that a transaction is fraudulent as a smooth and interpretable function of key predictors such as transaction amount, account activity, and time of day. Continuous variables are represented with spline-based smooth functions to capture nonlinear patterns, while categorical variables are incorporated as factors. The model is fitted using the mgcv package in R, which applies penalized regression splines and generalized cross-validation (GCV) to optimize smoothness and prevent overfitting (Wood, 2017). After fitting, the smooth terms illustrate how each variable influences fraud likelihood, enabling visual interpretation of complex effects. Model performance is then evaluated using metrics such as AUC, accuracy, and recall, and the trained model is applied to the test dataset to identify fraudulent transactions.
The overall modeling process is summarized in the flow chart below, which outlines the key steps from data preparation through model evaluation and interpretation.
Equation
Formally, a GAM can be expressed as:
\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]
where \(g(\mu)\) is the link function (e.g., logit for binary outcomes or identity for continuous outcomes), \(\alpha\) is the intercept, and \(s_j(X_j)\) are smooth functions of the predictor variables \(X_j\). This structure allows each predictor to contribute a smoothed effect to the model, capturing complex patterns in the data without obscuring the individual influence of each variable. By balancing flexibility and clarity, GAMs offer a practical alternative to fully nonparametric methods, which can become computationally intensive and difficult to interpret. The additive smooth functions \(s_j(X_j)\) are at the heart of GAMs, enabling the model to uncover nonlinear patterns while maintaining interpretability for each predictor.
Assumptions
The model assumes a link function that connects the predictors to the response in a roughly linear way. For fraud detection, this usually means using a logit link to model the chance of a transaction being fraudulent.
The effects of the predictors are additive. Each variable adds its own influence, and the total prediction is the sum of those parts.
Observations are independent, meaning one transaction does not affect another. Each case stands on its own.
The model assumes smooth changes in the relationships. When a predictor changes, its effect on fraud risk changes gradually, not suddenly.
The response variable follows a known distribution. For this project, it is assumed to be binomial since the outcome is either fraud or not fraud.
The smoothness settings and penalty values are chosen so the model captures real trends without overfitting the data.
Predictors are assumed to not be too strongly correlated with each other so the model can estimate each variable’s effect clearly.
Sample Data
Analysis and Results
Data Exploration and Visualization
Data set Description
The Fraud Detection Transactions Dataset (Ashar, 2024) is a meticulously crafted, synthetic dataset that replicates real-world financial transaction patterns, making it a robust resource for building and testing fraud detection models. Hosted on Kaggle, it is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and is designed to simulate the complexity of financial systems while ensuring ethical data usage by avoiding real user information. The dataset’s realistic design captures nuanced fraud patterns, such as clustered fraudulent transactions, subtle anomalies, or irregular user behaviors, providing a challenging yet representative environment for machine learning applications in anomaly detection, risk assessment, and fraud prevention.
The dataset’s synthetic nature replicates realistic fraud patterns, including clustered fraudulent transactions, subtle anomalies, and irregular user behaviors, while avoiding privacy concerns. Although the exact number of records is unspecified, the data set is designed to be sufficiently large and diverse, with a mix of typical transactions and rare fraudulent events to address class imbalance — a common challenge in fraud detection. Potential data quality issues, such as noisy data, missing values, or outliers, reflect real-world complexities and require preprocessing steps like data cleaning, categorical encoding, or normalization. These challenges necessitate robust modeling techniques to handle noise and ensure accurate predictions.
Key Characteristics
The dataset simulates real-world financial transaction patterns, capturing diverse user behaviors and transaction characteristics while ensuring ethical data usage through its synthetic design. It is tailored for binary classification tasks, with transactions labeled as fraudulent (1) or non-fraudulent (0), and includes 50,000 rows of data with 21 features categorized as follows:
Size and Scope: Contains thousands of individual transactions, each labeled as either fraudulent (1) or non-fraudulent (0).
Features (21 total):
Numerical variables: transaction amounts, risk scores, balances, and other continuous measures.
Categorical variables: transaction types (e.g., payment, transfer, withdrawal), device types, and merchant categories.
Temporal variables: transaction time, day, and sequencing patterns that capture behavioral dynamics.
Label Distribution: Fraudulent transactions represent a small percentage of the data, reflecting the real-world class imbalance in fraud detection problems.
Realism: Although synthetic, the dataset mirrors real-world fraud scenarios by including behavioral signals, unusual spending patterns, and high-risk profiles.
Flexibility: Supports various modeling approaches, from interpretable methods (e.g., GAMs, logistic regression) to high-performance ensemble models (e.g., XGBoost).
Visualizations
Code
# Load libraries
library(tidyverse)
library(janitor)
library(gt)
library(scales)
# === Load dataset ===
data_path <- "synthetic_fraud_dataset.csv"
df <- readr::read_csv(data_path, show_col_types = FALSE) |>
clean_names()
# === Create count tables ===
tbl_type <- df |>
count(transaction_type, name = "Count") |>
arrange(desc(Count)) |>
rename(Type = transaction_type)
tbl_device <- df |>
count(device_type, name = "Count") |>
arrange(desc(Count)) |>
rename(Device = device_type)
tbl_merchant <- df |>
count(merchant_category, name = "Count") |>
arrange(desc(Count)) |>
rename(Merchant_Category = merchant_category)
# === Blue Theme for gt Tables ===
style_blue_gt <- function(.data, title_text) {
.data |>
gt() |>
tab_header(title = md(title_text)) |>
fmt_number(columns = "Count", decimals = 0, sep_mark = ",") |>
tab_options(
table.font.names = "Arial",
table.font.size = 14,
data_row.padding = px(6),
heading.align = "left",
table.border.top.color = "darkblue",
table.border.top.width = px(3),
table.border.bottom.color = "darkblue",
table.border.bottom.width = px(3)
) |>
tab_style(
style = list(cell_fill(color = "darkblue"),
cell_text(color = "white", weight = "bold")),
locations = cells_title(groups = "title")
) |>
tab_style(
style = list(cell_fill(color = "steelblue"),
cell_text(color = "white", weight = "bold")),
locations = cells_column_labels(everything())
) |>
opt_row_striping() |>
cols_align("right", columns = "Count")
}
# === Render all three blue tables ===
style_blue_gt(tbl_type, "Table 1 – Transaction Types and Counts")| Table 1 – Transaction Types and Counts | |
|---|---|
| Type | Count |
| POS | 12,549 |
| Online | 12,546 |
| ATM Withdrawal | 12,453 |
| Bank Transfer | 12,452 |
Code
style_blue_gt(tbl_device, "Table 2 – Device Types and Counts")| Table 2 – Device Types and Counts | |
|---|---|
| Device | Count |
| Tablet | 16,779 |
| Mobile | 16,640 |
| Laptop | 16,581 |
Code
style_blue_gt(tbl_merchant, "Table 3 – Merchant Categories and Counts")| Table 3 – Merchant Categories and Counts | |
|---|---|
| Merchant_Category | Count |
| Clothing | 10,033 |
| Groceries | 10,019 |
| Travel | 10,015 |
| Restaurants | 9,976 |
| Electronics | 9,957 |
Categorical Variable Count Tables
These tables display the counts for our categorical variables. While the dataset is synthetic and the categories are relatively evenly distributed, generalized additive models (GAMs) remain an appropriate analytical approach. GAMs provide the flexibility to model complex, nonlinear relationships between predictors and outcomes, accommodating both categorical and continuous variables. The even distribution of categories in the synthetic data does not compromise the validity of GAMs; it primarily affects the interpretability of specific category effects rather than the model’s overall applicability. Therefore, GAMs can still yield meaningful insights into the underlying patterns and relationships within this dataset.
Code
# Load libraries
library(ggplot2)
library(dplyr)
library(tidyr) # For pivot_longer
library(gridExtra) # For arranging plots
#install.packages("moments")
library(moments) # For skewness and kurtosisCode
library(tidyverse)
library(lubridate)
library(patchwork) # for arranging multiple ggplots
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")
# Convert Timestamp to date and calculate Issuance_Year if needed
fraud_data <- fraud_data %>%
mutate(
Timestamp = ymd_hms(Timestamp, quiet = TRUE), # adjust format if needed
Transaction_Year = year(Timestamp),
Issuance_Year = Transaction_Year - Card_Age
) %>%
filter(!is.na(Card_Age)) # remove rows with NA in Card_Age
# Variables to plot (move Transaction_Amount to last)
numeric_vars <- c("Account_Balance", "Transaction_Distance", "Risk_Score", "Card_Age", "Transaction_Amount")
# Create a list to store plots
plot_list <- list()
# Generate plots and store in the list
for (var in numeric_vars) {
p <- ggplot(fraud_data, aes_string(x = var)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
labs(title = paste("Distribution of", var),
x = var,
y = "Count") +
theme_light()
plot_list[[var]] <- p
}
# Arrange plots in a grid: 2 plots per row
(plot_list[[1]] | plot_list[[2]]) /
(plot_list[[3]] | plot_list[[4]]) /
plot_list[[5]] # Transaction_Amount appears lastDistribution of Numeric Variables
The transaction amount histogram shows a strong right-skewed distribution. Most transactions involve small amounts, while a few high-value transactions exist on the far right tail. This pattern indicates that fraudulent behavior may cluster around extreme transaction amounts.The skewness suggests that a log-transformation or nonlinear modeling (via GAM) can help stabilize variance and capture the curved fraud risk pattern across transaction sizes.
Code
ggplot(fraud_data, aes(x = as.factor(Fraud_Label), y = Risk_Score, fill = as.factor(Fraud_Label))) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("0" = "steelblue", "1" = "red"),
name = "Fraud Label",
labels = c("Legit", "Fraud")) +
labs(title = "Distribution of Risk Scores by Fraud Label",
x = "Fraud Label",
y = "Risk Score") +
theme_light() +
theme(legend.position = "none")Distribution of Risk Scores
The boxplot shows the distribution of Risk_Score for fraudulent versus legitimate transactions. Fraudulent transactions generally have higher scores, with a higher median and upper quartile, while legitimate transactions cluster at lower values. This suggests that Risk_Score is a meaningful feature for distinguishing fraud. Using a GAM, we can formally test how Risk_Score relates to fraud, capturing potential non-linear effects in the data.
Code
library(tidyverse)
library(lubridate)
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")
# Convert Timestamp to date, calculate Transaction Year and Issuance Year, exclude NAs
fraud_data <- fraud_data %>%
mutate(
Timestamp = ymd_hms(Timestamp), # adjust if format differs
Transaction_Year = year(Timestamp),
Issuance_Year = Transaction_Year - Card_Age
) %>%
filter(!is.na(Issuance_Year), !is.na(Card_Age)) # remove rows with NA
# Bin Issuance Year into 5-year ranges and drop unused NA factor levels
fraud_data <- fraud_data %>%
mutate(
Issuance_Year_Bin = cut(Issuance_Year,
breaks = seq(2000, 2025, by = 5),
right = FALSE,
labels = c("2000-2004","2005-2009","2010-2014","2015-2019","2020-2024"))
) %>%
filter(!is.na(Issuance_Year_Bin)) # drop any rows that fall outside the bins
# Histogram
ggplot(fraud_data, aes(x = Issuance_Year_Bin)) +
geom_bar(fill = "steelblue", color = "white") +
labs(title = "Card Age Distribution by Issuance Year Range",
x = "Card Issuance Year Range",
y = "Count") +
theme_light()Distribution of Card Age
Card age tends to show a left-skewed distribution — many cards are relatively new, with fewer older cards. Older cards (e.g., issued in 2015–2017) may be more vulnerable if security features are outdated.Newer cards (e.g., 2023–2024) might show different usage patterns — possibly more digital or mobile transactions.Peaks in certain years could reflect onboarding campaigns or fraud targeting specific cohorts.This suggests that fraud risk may vary by card maturity: new cards could face higher risk due to unfamiliar usage patterns. GAM’s smooth terms can model such non-monotonic age–fraud relationships.
Code
library(tidyverse)
# Load dataset
fraud_data <- read.csv("synthetic_fraud_dataset.csv")
# Ensure Fraud_Label is numeric (0/1)
fraud_data <- fraud_data %>%
mutate(Fraud_Label = as.numeric(Fraud_Label))
# Nonlinearity check: Transaction Amount vs Fraud Probability
ggplot(fraud_data, aes(x = Transaction_Amount, y = Fraud_Label)) +
geom_smooth(method = "loess", se = FALSE, color = "darkblue") +
labs(title = "Relationship Between Transaction Amount and Fraud Probability",
x = "Transaction Amount",
y = "Fraud Probability") +
theme_light()Non-linearity Check
The plot shows a nonlinear relationship between transaction amount and fraud probability, supporting the use of GAM’s to flexibly model such effects. Transaction amount is a key continuous predictor, illustrating the need for a flexible approach before analyzing the full set of variables.